GH-36411: [Python] Use scikit-build-core as build backend for PyArrow and get rid of setup.py#49259
GH-36411: [Python] Use scikit-build-core as build backend for PyArrow and get rid of setup.py#49259raulcd wants to merge 22 commits intoapache:mainfrom
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit wheel-*-cp313-cp313-amd64 |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit wheel-windows-cp313-cp313-amd64 |
|
Revision: 166ff63 Submitted crossbow builds: ursacomputing/crossbow @ actions-f1bbdf4eaa
|
|
@github-actions crossbow submit wheel-windows-cp313-cp313-amd64 |
|
Revision: 36fefd4 Submitted crossbow builds: ursacomputing/crossbow @ actions-bd811d95ea
|
|
@github-actions crossbow submit -g python -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit python-sdist |
|
Revision: c518c90 Submitted crossbow builds: ursacomputing/crossbow @ actions-ea249e92ce
|
|
@github-actions crossbow submit python-sdist |
|
Revision: 131e2c0 Submitted crossbow builds: ursacomputing/crossbow @ actions-1c58e78aff
|
|
@github-actions crossbow submit python-sdist |
|
Revision: 30e04be Submitted crossbow builds: ursacomputing/crossbow @ actions-3179f97956
|
|
@github-actions crossbow submit -g python -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit wheel-windows-cp313-cp313-amd64 |
|
Revision: 5606749 Submitted crossbow builds: ursacomputing/crossbow @ actions-3c4293e220
|
WillAyd
left a comment
There was a problem hiding this comment.
Nice work - some small comments but generally this lgtm
| try: | ||
| yield | ||
| finally: | ||
| if sys.platform != "win32": |
There was a problem hiding this comment.
So these edits are made to the source tree directly right? No way to do this in an isolated build location?
There was a problem hiding this comment.
I am unsure how we could manage those. We want to do non isolated builds to build against specific numpy/pandas versions, mandating isolated builds creates complexity on environments and dependency management. If we try to copy ourselves on a different location to build we are also creating a bunch of complexity for a really simple thing, having licenses copied! Do you have any suggestions on what I could try?
To be fair I've found the whole licenses / notice and how build backends manages them to be too tight and inflexible leaving cases like ours on a difficult spot. I have been thinking on just copying the files but maintenance is a burden just for keeping them in sync. Maybe we should copy them and have a CI check that validates they are in-sinc?
I feel like a really simple build backend that just copies the data, either by moving soft-links to hard links or, in the case of Windows, copying the files is the simplest solution.
There was a problem hiding this comment.
Yea I agree with your overall sentiment - sadly its a lot of effort to maintain these two files :-)
I think the meson-python approach was pretty good, where a dist script copied these into the source distribution at the time it was being made. I'm wondering if scikit-build-core has a similar hook
There was a problem hiding this comment.
From what I understand from scikit-build-core when building sdist we can't use an out of source tree. Basically the sdist is copying the files that are under Path() (.):
https://github.com/scikit-build/scikit-build-core/blob/49b26c49fb4bc2a94177e4ef27c79e6389375710/src/scikit_build_core/build/sdist.py#L153-L161
There's no config setting where we could use an out of source sdist where we could copy files.
There was a problem hiding this comment.
Have you raised the issue on the scikit-build-core issue tracker?
There was a problem hiding this comment.
I have just asked about it:
…her we use build or pip
|
@github-actions crossbow submit -g python -g wheel |
|
Revision: 0ada5d9 Submitted crossbow builds: ursacomputing/crossbow @ actions-5f312d0d53 |
| run: | | ||
| gem install test-unit openssl | ||
| pip install "cython>=3.1" setuptools pytest requests setuptools-scm | ||
| pip install build "cython>=3.1" pytest requests scikit-build-core setuptools-scm |
There was a problem hiding this comment.
Can we rely on pyproject.toml for build dependencies at some point?
There was a problem hiding this comment.
This is something that we are already facing (not strictly related to this PR). I've opened an enhancement for it:
| set PYARROW_BUILD_TYPE=%CMAKE_BUILD_TYPE% | ||
| set PYARROW_BUILD_VERBOSE=1 | ||
| set PYARROW_BUNDLE_ARROW_CPP=ON | ||
| set PYARROW_CMAKE_GENERATOR=%CMAKE_GENERATOR% |
There was a problem hiding this comment.
Does scikit-build-core pick up the CMAKE_xxx environment variables automatically?
There was a problem hiding this comment.
Based on other changes you did, it doesn't seem it does?
There was a problem hiding this comment.
It doesn't pick the PYARROW_CMAKE_xxx (like PYARROW_CMAKE_OPTIONS) but it does with the standard CMAKE_xxx for example I've tested CMAKE_BUILD_PARALLEL_LEVEL both 1 and 16 and if I use -C build.verbose=true I can see Run Build Command(s): /usr/bin/ninja -v -j 1 or Run Build Command(s): /usr/bin/ninja -v -j 16 respectively.
As per CMAKE_BUILD_TYPE this is something to do with the multi-config generator we should rely on -C cmake.build-type="${CMAKE_BUILD_TYPE:-Debug} so scikit-build-core will either set CMAKE_BUILD_TYPE or pass --config on build
| try: | ||
| yield | ||
| finally: | ||
| if sys.platform != "win32": |
There was a problem hiding this comment.
Have you raised the issue on the scikit-build-core issue tracker?
| # as it's the most common build type for users building from source. | ||
| # This is mainly relevant for our Windows wheels, which are built with | ||
| # Visual Studio and thus use a multi-config generator with Release. | ||
| # As a note this is only to populate config_internal.h.cmake. |
There was a problem hiding this comment.
Hmm, it's not possible to populate config_internal.h.cmake separately for each config?
There was a problem hiding this comment.
I have no idea how to do that with a multi-config generator because this would have to be done at build time not at config time.
I guess we could stop using configure_file for the multi-config case, drop config_internal.h.cmake and create a new file at build time that just generates the full file?
@kou do you have any idea what we could do here?
BTW this is something that is not really working today for multi-config generators because if we configure using CMAKE_BUILD_TYPE (will only populate the file, no effect on the real build type) but we build with --config using a different build type the result would be wrong.
| pyarrow = ["*.pxd", "*.pyi", "*.pyx", "includes/*.pxd", "py.typed"] | ||
| [tool.scikit-build.cmake.define] | ||
| PYARROW_BUNDLE_ARROW_CPP = {env = "PYARROW_BUNDLE_ARROW_CPP", default = "OFF"} | ||
| PYARROW_BUNDLE_CYTHON_CPP = {env = "PYARROW_BUNDLE_CYTHON_CPP", default = "OFF"} |
There was a problem hiding this comment.
Did you check that those were actually bundled in the wheels?
There was a problem hiding this comment.
I did not but I have now 😄
PYARROW_BUNDLE_ARROW_CPP works. Building a wheel with PYARROW_BUNDLE_ARROW_CPP=OFF:
$ ls -lrth dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
-rw-rw-r-- 1 raulcd raulcd 6.9M Feb 26 13:25 dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
$ cd dist
$ unzip pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
$ find . -iname libarrow*
./pyarrow/libarrow_python_flight.so.2400
./pyarrow/libarrow_python.so
./pyarrow/includes/libarrow_substrait.pxd
./pyarrow/includes/libarrow_feather.pxd
./pyarrow/includes/libarrow_acero.pxd
./pyarrow/includes/libarrow_python.pxd
./pyarrow/includes/libarrow_dataset.pxd
./pyarrow/includes/libarrow_dataset_parquet.pxd
./pyarrow/includes/libarrow_fs.pxd
./pyarrow/includes/libarrow.pxd
./pyarrow/includes/libarrow_cuda.pxd
./pyarrow/includes/libarrow_flight.pxd
./pyarrow/libarrow_python_flight.so
./pyarrow/libarrow_python_flight.so.2400.0.0
./pyarrow/libarrow_python.so.2400.0.0
./pyarrow/libarrow_python.so.2400with PYARROW_BUNDLE_ARROW_CPP=ON:
$ ls -lrth pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
-rw-rw-r-- 1 raulcd raulcd 162M Feb 26 13:34 pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
# unzip
$ find . -iname libarrow*
./pyarrow/libarrow.so.2400
./pyarrow/libarrow_acero.so.2400
./pyarrow/libarrow_python_flight.so.2400
./pyarrow/libarrow_compute.so.2400
./pyarrow/libarrow_python.so
./pyarrow/libarrow_flight.so.2400
./pyarrow/includes/libarrow_substrait.pxd
./pyarrow/includes/libarrow_feather.pxd
./pyarrow/includes/libarrow_acero.pxd
./pyarrow/includes/libarrow_python.pxd
./pyarrow/includes/libarrow_dataset.pxd
./pyarrow/includes/libarrow_dataset_parquet.pxd
./pyarrow/includes/libarrow_fs.pxd
./pyarrow/includes/libarrow.pxd
./pyarrow/includes/libarrow_cuda.pxd
./pyarrow/includes/libarrow_flight.pxd
./pyarrow/libarrow_python_flight.so
./pyarrow/libarrow_dataset.so.2400
./pyarrow/libarrow_python_flight.so.2400.0.0
./pyarrow/libarrow_python.so.2400.0.0
./pyarrow/libarrow_python.so.2400PYARROW_BUNDLE_CYTHON_CPP also works (See the cpp files). Building a wheel with PYARROW_BUNDLE_CYTHON_CPP=OFF:
Archive: pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
inflating: pyarrow/__init__.pxd
inflating: pyarrow/__init__.py
inflating: pyarrow/_acero.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_acero.pxd
inflating: pyarrow/_acero.pyx
inflating: pyarrow/_azurefs.pyx
inflating: pyarrow/_compute.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_compute.pxd
inflating: pyarrow/_compute.pyx
inflating: pyarrow/_compute_docstrings.py
inflating: pyarrow/_csv.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_csv.pxd
inflating: pyarrow/_csv.pyx
inflating: pyarrow/_cuda.pxd
inflating: pyarrow/_cuda.pyx
inflating: pyarrow/_dataset.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_dataset.pxd
inflating: pyarrow/_dataset.pyx
inflating: pyarrow/_dataset_orc.pyx
inflating: pyarrow/_dataset_parquet.pxd
inflating: pyarrow/_dataset_parquet.pyx
inflating: pyarrow/_dataset_parquet_encryption.pyx
...
$ ls -lrth dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
-rw-rw-r-- 1 raulcd raulcd 6.9M Feb 26 13:55 dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whlwith PYARROW_BUNDLE_CYTHON_CPP=ON:
inflating: pyarrow/_acero.cpp
inflating: pyarrow/_acero.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_acero.pxd
inflating: pyarrow/_acero.pyx
inflating: pyarrow/_azurefs.pyx
inflating: pyarrow/_compute.cpp
inflating: pyarrow/_compute.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_compute.pxd
inflating: pyarrow/_compute.pyx
inflating: pyarrow/_compute_docstrings.py
inflating: pyarrow/_csv.cpp
inflating: pyarrow/_csv.cpython-313-x86_64-linux-gnu.so
inflating: pyarrow/_csv.pxd
inflating: pyarrow/_csv.pyx
inflating: pyarrow/_cuda.pxd
inflating: pyarrow/_cuda.pyx
inflating: pyarrow/_dataset.cpp
inflating: pyarrow/_dataset.cpython-313-x86_64-linux-gnu.so
...
$ ls -lrth dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
-rw-rw-r-- 1 raulcd raulcd 11M Feb 26 14:26 dist/pyarrow-24.0.0.dev164+g0ada5d92a-cp313-cp313-linux_x86_64.whl
# unzip| [tool.scikit-build.cmake.define] | ||
| PYARROW_BUNDLE_ARROW_CPP = {env = "PYARROW_BUNDLE_ARROW_CPP", default = "OFF"} | ||
| PYARROW_BUNDLE_CYTHON_CPP = {env = "PYARROW_BUNDLE_CYTHON_CPP", default = "OFF"} | ||
| PYARROW_GENERATE_COVERAGE = {env = "PYARROW_GENERATE_COVERAGE", default = "OFF"} |
There was a problem hiding this comment.
At some point we should check whether this flag still works. Eons ago it was used to generate codecov reports, but we ditched all the corresponding scripting that had to merge coverage files generated during the C++ test suite and the Python test suite.
Can you open an issue for that?
There was a problem hiding this comment.
|
@github-actions crossbow submit wheel*-cp313-* python-sdist |
|
Revision: 05953fa Submitted crossbow builds: ursacomputing/crossbow @ actions-0bd37a6444 |
|
@github-actions crossbow submit wheel-windows-cp313-* |
|
Revision: 5b96076 Submitted crossbow builds: ursacomputing/crossbow @ actions-51187e0471
|
Rationale for this change
Move our PyArrow build backend from setuptools and a custom setup.py to scikit-build-core which is just build backend for CMake related projects.
What changes are included in this PR?
Move from setuptools to scikit-build-core and remove PyArrow setup.py. Update some of the build requirements and minor fixes.
A custom build backend has been also been created in order to wrap scikit-build-core in order to fix problems on License files for monorepos.
pyproject.toml metadata validation expects license files to exist before exercising the build backend that's why we create symlinks. Our thin build backend will just make those symlinks hard-links in order for license and notice files to contain the contents and be added as part of the sdist.
Remove flags that are not used anymore (were only part of setup.py) and documented and validated how the same flags have to be used now.
Are these changes tested?
Yes all Python CI tests, wheels and sdist are successful.
Are there any user-facing changes?
Yes, users building PyArrow will now require the new build dependencies to exercise the build and depending on the flags used they might require to use the new documented way of using those flags.